Assignment: End-to-End Machine Learning Project

The material in this assignment is based on Chapter 2 of Hands-On Machine Learning with Scikit-Learn and TensorFlow, by Aurelien Geron.

What problem are we trying to solve and how will we solve it?

In our case, we will be trying to build a model of California housing prices using Census data. The primary goal: be able to predict the median housing price in any California district, using the data available in this dataset. This problem is an example of regression, where the prediction of our model (or its output) is a continuous variable. This is in contrast to classification, where the prediction of our model (or its output) is a class or group.

The data description can be found here: https://github.com/ageron/handson-ml2/tree/master/datasets/housing

The data file itself is in data/housing.csv, found in the ./data directory of this module.

The typical steps in such an analysis vary depending on the problem, but they usually include the following:

We will go through all of these steps. We won't dwell on the details of the model - we will use it like a black box. Later on in the course, we will spend more time on the details.

In the code blocks below, we sometime give some hints on how to start. In other cases, we point you back to previous examples.

Task 1: Get the data and print the colums

Task 2: Explore the data

As we did in chapter 1, we are going to want to explore the data.

  1. Look at a few rows of the dataset: use housing.head().
  2. Get some info about the names and types of the columns in the dataframe, the number of rows, and how much memory the dataframe takes up: using housing.info()
  3. Get some basic statistical info about the dataframe (mean, std, etc): use housing.describe()
  4. Get correlations among all of the columns: use housing.corr()

Task 3: Histogram each column

Can you do this using a simple for loop?

Task 4: Make a scatter plot

Our goal in this assignment: predict the median_house_value given all of the other data. "median_house_value" will be our label. All of the other columns are our features.

Scatter plot median_house_value vs median_income. We would expect this to be highly correlated.

Task 5: Feature Engineering

Feature engineering refers to combining existing features to form new ones. These combinations might be simple (like the result of adding/subracting/multiplying/etc) or they could be more complex - like the results of a sophisticated analysis. The basic idea is to add information for each candidate data point, which will hopefully improve whatever model we end up using to perform our predictions.

In our case there are some obvious new features we can create.

  1. rooms_per_household
  2. bedrooms_per_household
  3. bedrooms_per_room
  4. people_per_household

Below we should how to make the first new feature. You should add the other 3.

Task 6: Make a categorical variable

Think about the startified sampling that we did earlier, and note that by far the most correlated variable in our dataset in median_income. So when we split our data we would like to know for sure that our test sample is close in distribution to the median income of our train sample. Will this be true if we just randomly split the data? We aready know that the answer is "not quite".

To test this, let's make a categorical vaiable called income_cat which describes median income.

We will have 5 categories, running from 1.0 (low) to 5.0 (high):

To see how to do this, refer to the previous example from the section titled "An example of the power of pandas", from the workbook "module0_intro/module0_2_more_python_and_ploty.ipynb". In that case, we divided the Power Plant dataset into labels High, Medium, and Low exhaust vacuum. Here we will use labels 1 through 5, defined as above.

Remember to insert this column into our housing dataframe. Use the column name "income_cat" for this.

Task 7: Dealing with missing data

You could try:

  1. Removing all rows with any missing data
  2. Replacing the missing data with the mean of the column: NOTE: if you do this, you must get the means from the training set. Think about why this is the case.

Pick one! Simplest is dropping all rows with any missing data.

Task 8: Train/Test Splitting

Out goal is to design an algorithm to prediction housing prices. To test our model, we will want to split our data into two parts:

  1. Training sample: This is the sample we will train our model on.
  2. Testing sample: This is the unseen data that we will test our trained model on. Good performance on this sample will ensure that our model generalizes well.

Use a split of 80% train and 20% test, and do stratified sampling based on the income category variable "income_cat" we made above.

Task 9: Feature Scaling and Transformation

We will use feature scaling as we did with the fligth dataset. In this case, use MinMaxScaler. Remember: you need to use the training set to fit the transformer, and you need to use the transformer on both the training and test sets.

An example of how to do this for multiple columns is in DataSetPrep in the section Min-Max scaling and sci-kit learn estimators.

Remember that we do not use these techniques for categorical or object columns (something different will be done).

To figure out which colums are which. use the code below:

print("housing column types:") print(housing.dtypes)

I ended up using the following columns for input to my minmax scaler:

["housing_median_age","total_rooms","total_rooms","total_bedrooms", "population","households","median_income", "rooms_per_household","bedrooms_per_household","bedrooms_per_room", "people_per_household"]

Task 9: More on Transformation: One-hot encoding

The 'ocean_proximity' variable is a text variable, that we will want to one-hot enode. Look at "Dealing with Text Features" in DataSetPrep.

Task 10: Combining everything before fitting

Refer to the ealrlier workbook titled "Putting Humpty-Dumpty back together!""

After all of our above work we should have:

  1. two numpy arrays containing our "scaled" numerical features, one for our training sample and one for our testing sample
  2. two one-hot-encoded numpy arrays for our categorical variable, one for our training sample and one for our testing sample

We need to combine these so we have one training numpy array, and one testing numpy array. Along with each of these, we will have label arrays, made from the median_house_value column for the test and train samples.

Task 11: Fit the data and test the fit

As before the fit model will be linear regression (we are using more than just a single feature but is it still just linear regression). Test the fit both using RMSE as well as ploting the difference (predicted-true label) vs true label - but only use the test data.

Extra credit: Some extra stuff

If you are looking for more to do! Each counts as 1/3 of the total extra credit for this assignment.

  1. We probably should have done this first.... but how do we know that our fit improved our knowledge? Is there a simple predictor that we could have used instead? How about if we predict the price simply based on the mean (or the median) of all housing prices? Use the mean squared error to do this.

  2. Try another predictor from sklearn: RandomForestRegressor and/or DecisionTreeRegressor. Make sure you test the fit results (using mean_squared_error) on BOTH the training AND test sets!

  3. Making maps This data is interesting since it has latitude and longitude. Previously we made world maps, but this depended on our data having tags which were country names. This is different. This will be more like a scatter-plot, but arranged on an existing map (primarily California). How do we do this? Google: plotly map scatter Take the code from the first example and modify it.

Extra Credit Part 1:

  1. We probably should have done this first.... but how do we know that our fit improved our knowledge? Is there a simple predictor that we could have used instead? How about if we predict the price simply based on the mean (or the median) of all housing prices? Use the mean squared error to do this.

Extra Credit Part 2:

Try another predictor from sklearn: RandomForestRegressor and/or DecisionTreeRegressor. Make sure you test the fit results (using mean_squared_error) on BOTH the training AND test sets!

Extra Credit Part 3:

Making maps

This data is interesting since it has latitude and longitude. Previously we made world maps, but this depended on our data having tags which were country names. This is different. This will be more like a scatter-plot, but arranged on an existing map (primarily California). How do we do this?

Google: plotly map scatter

Take the code from the first example and modify it. Pick something interesting to use for the size of the scatter points.